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We introduce an approach to inferring the causal architecture of stochastic dynamical systems 
that extends rate distortion theory to use causal shielding — a natural principle of learning. We 
study two distinct cases of causal inference: optimal causal filtering and optimal causal estimation. 

Filtering corresponds to the ideal case in which the probability distribution of measurement 
sequences is known, giving a principled method to approximate a system's causal structure at a 
desired level of representation. We show that, in the limit in which a model complexity constraint is 
relaxed, filtering finds the exact causal architecture of a stochastic dynamical system, known as the 
causal-state partition. From this, one can estimate the amount of historical information the process 
stores. More generally, causal filtering finds a graded model-complexity hierarchy of approximations 
to the causal architecture. Abrupt changes in the hierarchy, as a function of approximation, capture 
distinct scales of structural organization. 

For nonideal cases with finite data, we show how the correct number of underlying causal states 
can be found by optimal causal estimation. A previously derived model complexity control term 
allows us to correct for the effect of statistical fluctuations in probability estimates and thereby 
avoid over-fitting. 

PACS numbers: 02.50.-r 89.70.+C 05.45.-a 05.45.Tp 



Natural systems compute intrinsically and pro- 
duce information. This organization, often only 
indirectly accessible to an observer, is reflected 
to varying degrees in measured time series. 
Nonetheless, this information can be used to 
build models of varying complexity that capture 
the causal architecture of the underlying system 
and allow one to estimate its information process- 
ing capabilities. We investigate two cases. The 
first is when a model builder wishes to find a more 
compact representation than the true one. This 
occurs, for example, when one is willing to incur 
the cost of a small increase in error for a large 
reduction in model size. The second case con- 
cerns the empirical setting in which only a finite 
amount of data is available. There one wishes 
to avoid over-fitting a model to a particular data 
set. 



I. INTRODUCTION 

Time series modeling has a long and important his- 
tory in science and engineering. Advances in dynamical 
systems over the last half century led to new methods 
that attempt to account for the inherent nonlincarity in 
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many natural phenomena ^J7]- As a result, it is now 
well known that nonlinear systems produce highly corre- 
lated time series that are not adequately modeled under 
the typical statistical assumptions of linearity, indepen- 
dence, and identical distributions. One consequence, ex- 
ploited in novel state-space reconstruction methods J8j- 
Hu] . is that discovering the hidden structure of such pro- 
cesses is key to successful modeling and prediction [TlT - 
I14j . In an attempt to unify the alternative nonlinear 
modeling approaches, computational mechanics [15] in- 
troduced a minimal representation — the e-machine — for 
stochastic dynamical systems that is an optimal predictor 
and from which many system properties can be directly 
calculated. Building on the notion of state introduced in 
Ref. [5], a system's effective states are those variables 
that causally shield a system's past from its future — 
capturing, in the present, information from the past that 
predicts the future. 

Following these lines, here we investigate the problem 
of learning predictive models of time series with particu- 
lar attention paid to discovering hidden variables. We do 
this by using the information bottleneck method (IB) [T5] 
together with a complexity control method discussed by 
Ref. [17] , which is necessary for learning from finite data. 
Ref. [18] lays out the relationship between computational 
mechanics and the information bottleneck method. Here, 
we make the mathematical connection for times series, 
introducing a new method. 

We adapt IB to time series prediction, resulting in a 
method we call optimal causal filtering (OCF) 44J. Since 
OCF, in effect, extends rate-distortion theory [T§] to use 
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causal shielding, in general it achieves an optimal balance 
between model complexity and approximation accuracy. 
The implications of these trade-offs for automated theory 
building are discussed in Ref. [2"0] . 

We show that in the important limit in which predic- 
tion is paramount and model complexity is not restricted, 
OCF reconstructs the underlying process's causal archi- 
tecture, as previously defined within the framework of 
computational mechanics [13 [2H [22] • This shows that, 
in effect, OCF captures a source's hidden variables and 
organization. The result gives structural meaning to the 
inferred models. For example, one can calculate fun- 
damental invariants — such as, symmetries, entropy rate, 
and stored information — of the original system. 

To handle finite-data fluctuations, OCF is extended to 
optimal causal estimation (OCE). When probabilities are 
estimated from finite data, errors due to statistical fluc- 
tuations in probability estimates must be taken into ac- 
count in order to avoid over-fitting. We demonstrate how 
OCF and OCI work on a number of example stochastic 
processes with known, nontrivial correlational structure. 

II. CAUSAL STATES 

Assume that we are given a stochastic process P(X) — 

<->■ < — s- 

a joint distribution over a bi-infinite sequence X=XX 
of random variables. The past, or history, is denoted 

X= ■ ■ ■ X- 3 X- 2 X-!, while X= XqX x X 2 . ■ . denotes the 
future |45j . Here, the random variables X t take on dis- 
crete values x e A = {1,2, ... ,k} and the process as a 
whole is stationary. The following assumes the reader is 
familiar with information theory and the notation of Ref. 
[23]. 

Within computational mechanics, a process P{X) is 
viewed as a communication channel that transmits in- 
formation from the past to the future, storing informa- 
tion in the present — presumably in some internal states, 
variables, or degrees of freedom [23]. One can ask a sim- 
ple question, then: how much information does the past 
share with the future? A related and more demanding 
question is how we can infer a predictive model, given 
the process. Many authors have considered such ques- 
tions. Refs. jT5J [551 [25J [2S] review some of the related 
literature. 

The effective, or causal, states S are determined by 

an equivalence relation x ~ x that groups all histories 
together which give rise to the same prediction of the 
future [T51 [22] . The equivalence relation partitions the 

space X of histories and is specified by the set-valued 
function: 

e(x~) = {x' :P(x\x)=P(X\x')} (1) 

that maps from an individual history to the equivalence 
class a € S containing that history and all others which 

lead to the same prediction P{X \x) of the future. A 



causal state a includes: (i) a label a G S; (ii) a set of 
histories X a = {x : P(x \x) = P(X |cr)} C X; and (iii) 
a future conditional distribution P(X |c) given the state 

QS122]- 

Any alternative model, called a rival 1Z, gives a prob- 
abilistic assignment P(lZ\x) of histories to its states 
p € TZ. Due to the data processing inequality, a model 
can never capture more information about the future 
than shared between past and future: 

I[K;X]<I[X;X] , (2) 

where I[V, W] denotes the mutual information between 

random variables V and W [23 . The quantity E = I[X 

; X] has been studied by several authors and given dif- 
ferent names, such as (in chronological order) conver- 
gence rate of the conditional entropy 27 , excess entropy 
[21], stored information [25] . effective measure complex- 
ity [30], past-future mutual information [31], and predic- 
tive information [32] . amongst others. For a review see 
Ref. [25] and references therein. 

The causal states a £ S are distinguished by the fact 
that the function e(-) gives rise to a deterministic assign- 
ment of histories to states: 

P (^) - < (3) 

and, furthermore, by the fact that their future condi- 
tional probabilities are given by 

P{X \a) = P(X \x) , (4) 

for all 'x such that e(x) — a. As a consequence, the 
causal states, considered as a random variable S, capture 
the full predictive information 

I[S;X] =I[X;X] =E . (5) 

More to the point, they causally shield the past and 
future — the past and future are independent given the 

causal state: P(X, X \S) = P(X \S)P(X \S). 

The causal-state partition has, out of all equally predic- 
tive partitions, called prescient rivals 1Z [33], the smallest 
entropy, C^TZ] = H[TZ]: 

H[K] > H[S] , (6) 

known as the statistical complexity, C M := H[S]. This is 
amount of historical information a process stores: A pro- 
cess communicates E bits from the past to the future by 
storing C M bits in the present. C p is one of a process's key 
properties; the other is its entropy rate [23j . Finally, the 
causal states are unique and minimal sufficient statistics 
for prediction of the time series [T51 [22] • 
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III. CONSTRUCTING CAUSAL MODELS OF 
INFORMATION SOURCES 

Continuing with the communication channel analogy 
above, models, optimal or not, can be broadly consid- 
ered to be a lossy compression of the original data. A 
model captures some regularity while making some er- 
rors in describing the data. Rate distortion theory [TU] 
gives a principled method to find a lossy compression 
of an information source such that the resulting model 
is as faithful as possible to the original data, quanti- 
fied by a distortion function. The specific form of the 
distortion function determines what is considered to be 
"relevant" — kept in the compressed representation — and 
what is "irrelevant" — can be discarded. Since there is no 
universal distortion function, it has to be assumed ad hoc 
for each application. The information bottleneck method 
[16j argues for explicitly keeping the relevant informa- 
tion, defined as the mutual information that the data 
share with a desired relevant variable [IB]. With those 
choices, the distortion function can be derived from the 
optimization principle, but the relevant variable has to 
be specified a priori. 

In time series modeling, however, there is a natural 
notion of relevance: the future data. For stationary time 
series, moreover, building a model with low generaliza- 
tion error is equivalent to constructing a model that ac- 
curately predicts future data from past data. These ob- 
servations lead directly to an information-theoretic spec- 
ification for reconstructing time series models: First, in- 
troduce general model variables 1Z that can store, in the 
present moment, the information transmitted from the 
past to the future. Any set of such variables specifies a 

stochastic partition of X via a probabilistic assignment 
rule P(7?.| x). Second, require that this partition be max- 
imally predictive. That is, it should maximize the infor- 
mation I[TZ; X] that the variables 1Z contain about the 

future X- Third, the so-constructed representation of the 
historical data should be a summary, i.e., it should not 
contain all of the historical information, but rather, as 
little as possible while still capturing the predictive infor- 
mation. The information kept about the past — I[X)"R], 
the coding rate — measures the model complexity or bit 
cost. Intuitively, one wants to find the most predictive 
model at fixed complexity or, vice versa, the least com- 
plex model at fixed prediction accuracy. These criteria 
are equivalent, in effect, to causal shielding. 

Writing this intuition formally reduces to the informa- 
tion bottleneck method, where the relevant information 
is information about the future. The constrained opti- 
mization problem one has to solve is: 



max 

P{Tl\X) 



{l[K;X]-\I[X;K}} , 



(7) 



where the parameter A controls the balance between pre- 
diction and model complexity. The linear trade-off that A 
represents is an ad hoc assumption [18] . Its justification 



is greatly strengthened in the following by the rigorous 
results showing it leads to the causal states and the suc- 
cessful quantitative applications. 

The optimization problem of Eq. Q is solved sub- 
ject to the normalization constraint: P(72.| a: ) = 1, 

for all % € X- It then has a family of solutions |16j . 
parametrized by the Lagrange multiplier A, that gives 
the following optimal assignments of histories x to states 
peU: 



P op t( P \x) = ^-exp (-\V (P(X \x)\\P(X \pj) 



with 



(8) 



P(X \p) = P ^ l*) P M*) P (*) and (9) 



p( P ) = y, p (^) p (^) > 



(10) 



z ex 



where T> {P\\Q) is the information gain |23j between dis- 
tributions P and Q. In the solution it plays the role of 
an "energy" , effectively measuring how different the pre- 
dicted and true futures are. The more distinct, the more 
information one gains about the probabilistic develop- 
ment of the future from the past. That is, high energy 
models make predictions that deviate substantially from 
the process. 

These self-consistent equations are solved iteratively 
[TBJ using a procedure similar to the Blahut-Arimoto al- 
gorithm [34l ES] • A connection to statistical mechanics 
is often drawn, and the parameter A is identified with a 
(pseudo) temperature that controls the level of random- 
ness; see, e.g., Ref. [3BJ- This is useful to guide intuition 
and, for example, has inspired deterministic annealing 

m 

We are now ready for the first observation. 

Proposition 1. In the low- temperature regime (X — >■ 0) 
the assignments of pasts to states become deterministic 
and are given by: 



PoptGol^) = S 



pM x ) 



where 



(11) 



r)(x) = argminX> (P{X |ar)]|P(JST \p)) .(12) 



Proof. Define the quantity 



D(p) =V \V{X\x)\\V{x\p) j 

-2? (p(X|s)||P(X !»,(¥))) . (13) 



D(p) is positive, by definition Eq. (12) of rj{x). Now, 
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write 



P op t(?70)M 



1 



E 



p(p) 

Pfa(z)) 



exp 



g(g) 
A 



(14) 

TTie swn m the r.h.s. tends to zero, as A — > 0, assuming 
that P(rj(x)) > 0. Via normalization, the assignments 
become deterministic. □ 



IV. OPTIMAL CAUSAL FILTERING 

We now establish the procedure's fundamental prop- 
erties by connecting the solutions it determines to 
the causal representations defined previously within the 
framework of computational mechanics. The resulting 
procedure transforms the original data to a causal repre- 
sentation and so we call it optimal causal filtering (OCF). 

Note first that for deterministic assignments we have 

H[TZ\ X] = 0. Therefore, the information about the past 

becomes I[X; TV\ = H[7Z] and the objective function sim- 
plifies to 



F dct [K] =I[Tl-X] -XH[R] 



(15) 



Lemma 1. Within the subspace of prescient rivals, the 
causal-state partition maximizes F dct [72-] • 

Proof. This follows immediately from Eqs. ^ and |6|). 
They imply that 

F dct [R] = I[S; X] - XH[R] 
< I[S;X]-XH[S] 



— F dct [S] 



(16) 
□ 



The causal-state partition is the model with the largest 
value of the OCF objective function, because it is fully 
predictive at minimum complexity. We also know from 
Prop. [Tjthat in the low-temperature limit (A — > 0) OCF 
recovers a deterministic mapping of histories to states. 
We now show that this mapping is exactly the causal- 
state partition of histories. 

Theorem 1. OCF finds the causal-state partition o/X 
in the low-temperature limit, A — > 0. 

Proof. The causal-state partition, Eq. |I]), always ex- 
ists, and implies that there are groups of histories with 



P(X \x) = P(X\e(x)) . (17) 
We then have, for all x EX> 

V (p(X \x)\\P(X \e(x)^ =0 , (18) 



and, hence, 

e(x) = argminV (P(X \x)\\P(X \pfj . (19) 



Therefore, we can identify e(x) = r](x) in Eq. (12), 



and so the assignment of histories to the causal states is 
recovered by OCF: 



P op t(p\x) = 6 



(20) 
□ 



Note that we have not restricted the size of the set 1Z of 
model states. Recall also that the causal-state partition is 
unique The Lemma establishes that OCF does not 
find prescient rivals in the low-temperature limit. The 
prescient rivals are suboptimal in the particular sense 
that they have smaller values of the objective function. 
We now establish that this difference is controlled by the 
model size with proportionality constant A. 

Corollary 1. Prescient rivals are suboptimal in OCF. 
The value of the objective function evaluated for a pre- 
scient rival is smaller than that evaluated for the causal- 
state model. The difference AF dc t [1Z] = F dct [S] — -fdct [72-] 
is given by: 



AF dct [TZ] =\(C fl [TZ}- C^S] > 



Proof. 

AF dct [K} = F dct [5] -F dot [K] 

= I[S; X] - I[H; X] - XH[S] + XH[H] 
= X (c^n] - C^S}] 



Moreover, Eq. |#|) implies that A-F^et ^ 0- 



(21) 

(22) 
(23) 
(24) 

□ 



So, we see that for A = 0, causal states and all other 
prescient rival partitions are degenerate. This is to be 
expected as at A = the model-complexity constraint 
disappears. Importantly, this means that maximizing 
the predictive information alone, without the appropri- 
ate constraint on model complexity does not suffice to 
recover the causal-state partition. 



V. EXAMPLES 

We study how OCF works on a series of example 
stochastic processes of increasing statistical sophistica- 
tion. We compute the optimal solutions and visualize the 
trade-off between predictive power and complexity of the 
model by tracing out a curve similar to a rate-distortion 
curve [SU [35]: For each value of A, we evaluate both 

the model's coding rate I[X;'R-] and its predicted infor- 
mation 7[72.;X] at the optimal solution and plot them 
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against each other. The resulting curve in the informa- 
tion plane |16j separates the feasible from the infeasible 
region: It is possible to find a model that is more com- 
plex at the same prediction error, but not possible to find 
a less complex model than that given by the optimum. 
In analogy to a rate-distortion curve, we can read off the 
maximum amount of information about the future that 
can be captured with a model of fixed complexity. Or, 
conversely, we can read off the smallest representation at 
fixed predictive power. 

The examples in this and the following sections are cal- 
culated by solving the self-consistent Eqs. Q to ( 10 1 iter- 
atively @5] at each value of A. To trace out the curves, a 
deterministic annealing I37j scheme is implemented, low- 
ering A by a fixed annealing rate. Smaller rates cost more 
computational time, but allow one to compute the rate- 
distortion curve in greater detail, while larger rates re- 
sult in a rate-distortion curve that gets evaluated in fewer 
places and hence looks coarser. In examples, naturally, 
one can only work with finite length past and future se- 
quences: x K and x L , where K and L give their lengths, 
respectively. 



A. Periodic limit cycle: A predictable process 

We start with an example of an exactly periodic pro- 
cess, a limit cycle oscillation. It falls in the class of de- 
terministic and time reversible processes, for which the 
rate-distortion curve can be computed analytically — it 
lies on the diagonal [20]. We demonstrate this with a 
numerical example. Figure [l] shows how OCF works on a 
period-four process: (0011)°°. (See Figs. [l]and[2j) There 

are exactly two bits of predictive information J[X;X] to 
be captured about future words of length two (dotted 
horizontal line). This information describes the phase 
of the period-four cycle. To capture those two bits, one 
needs exactly four underlying causal states and a model 
complexity of = 2 bits (dotted vertical line). 

The curve is the analog of a rate-distortion curve, ex- 
cept that the information plane swaps the horizontal and 
vertical axes — the coding rate and distortion axes. (See 
Ref. [20] for the direct use of the rate-distortion curve.) 

The value of X 2 ) (the "distortion"), evaluated at the 

optimal distribution, Eq. (JsJ> , is plotted versus I[X 3 ;7£] 
(the "code rate"), also evaluated at the optimum. Those 
are plotted for different values of A and, to trace out the 
curve, deterministic annealing is implemented. At large 
A, we are in the lower left of the curve — the compression 
is extreme, but no predictive information is captured. A 
single state model, a fair coin, is found as expected. As 
A decreases (moving to the right) , the next distinct point 
on the curve is for a two-state model, which discards half 
of the information. This comes exactly at the cost of one 
predictive bit. Finally, OCF finds a four-state model that 
captures all of the predictive information at no compres- 
sion. The numbers next to the curve indicate the first 



time that the effective number of states increases to that 
value. 

The four-state model captures the two bits of predic- 
tive information. But compressed to one bit (using two 
states), one can only capture one bit of predictive infor- 
mation. The information curve falls onto the diagonal — a 
straight line that is the worst case for possible beneficial 
trade-offs between prediction error and model complexity 

EDI- 

In Fig. [2] we show the best two-state model compared 
to the full (exact) four-state model. One of the future 
conditional probabilities captures zero probability events 
of "odd" {01, 10} words, assigning equal probability to 
the "even" {00, 11} words. The other one captures zero 
probability events of even words, assigning equal proba- 
bility to the odd words. This captures the fundamental 
determinism of the process: an odd word never follows 
an even word and vice versa. The overall result illus- 
trates how the actual long-range correlation in the com- 
pletely predictable period-4 sequence is represented by a 
smaller stochastic model. While in the four-state model 
the future conditional probabilities are 5-functions, in the 
two-state approximate model they are mixtures of those 
(5-functions. In this way, OCF converts structure to ran- 
domness when approximating underlying states with a 
compressed model; cf. the analogous trade-off discussed 
in Ref. [251. 



B. Golden Mean Process: A Markov chain 

The Golden Mean (GM) Process is a Markov chain 
of order one. As an information source, it produces all 
binary strings with the restriction that there are never 
consecutive 0s. The GM Process generates 0s and Is with 
equal probability, except that once a is generated, a 1 
is always generated next. One can write down a simple 
two-state Markov chain for this process; see, e.g., Ref. 

m 

Figures [3] and [4] demonstrate how OCF reconstructs 
the states of the GM process. Figure [3] shows the be- 
havior of OCF in the information plane. At very high 
temperature (A — > 00, lower left corner of the curve) 
compression dominates over prediction and the resulting 
model is most compact, with only one effective causal 
state. However, it contains no information about the 
future and so is a poor predictor. As A decreases (mov- 
ing right), OCF reconstructs increasingly more predic- 
tive and more complex models. The curve shows that 
the information about the future, contained in the opti- 
mal partition, increases (along the vertical axis) as the 
model increases in complexity (along the horizontal axis). 
There is a transition to two effective states: the number 
2 along the curve denotes the first occurrence of this in- 
crease. As A — > 0, prediction comes to dominate and 
OCF finds a fully predictive model, albeit one with the 
minimal statistical complexity, out of all possible state 
partitions that would retain the full predictive informa- 
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FIG. 1: Model predictability I[1Z; X L ] versus model complexity (size) I[X K ; 72] trade-off under OCF for the exactly predictable 
period-4 process: (0011)°°. Monitored in the information plane. The horizontal dashed line is the full predictive information 

(E = I[X 3 ; X 2 ] = 2 bits) and the vertical dashed line is the block entropy (H[X 3 ] = 2 bits), which is also the statistical 
complexity C M . The data points represent solutions at various A. Lines connect them to help guide the eye only. Histories of 
length K — 3 were used, along with futures of length L = 2. In this and the following information plane plots, the integer 
labels N c indicate the first point at which the effective number of states used by the model equals N c . 
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FIG. 2: Morphs P(X 2 |-) for the period-4 process: The 2-state approximation (circles) compared to the <5-function morphs for 

the 4 causal states (boxes). The morphs P(X 2 \cr) for the two-state approximation are (1/2,0,0,1/2) and (0,1/2,1/2,0) and 
for the four-state case (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), and (0, 0, 0, 1). Histories of length K = 3 were used, along with futures 
of length L — 2 (crosses). 



tion. The model's complexity — C M ~ 0.92 bits — is 41% 
of the maximum, which is given by the entropy of all pos- 

sible pasts of length 3: H [X 3 ] ~ 2.25 bits. The remainder 
(59%) of the information is nonpredictive and has been 
filtered out by OCF. Figure [4] shows the future condi- 
tional probabilities, associated with the partition found 

by OCF, as A —> 0, corresponding to P(X 2 \p) (circles). 
These future conditional probabilities overlap with the 
true (but not known to the algorithm) causal-state fu- 
ture conditional probabilities P(X 2 \a) (boxes) and so 
demonstrate that OCF finds the causal-state partition. 



C. Even Process: A hidden Markov chain 

Now, consider a hidden Markov process: the Even Pro- 
cess [3S] , which is a stochastic process whose support (the 
set of allowed sequences) is a symbolic dynamical system 
called the Even system. The Even system generates all 
binary strings consisting of blocks of an even number of 
Is bounded by 0s. Having observed a process's sequences, 
we say that a word (finite sequence of symbols) is forbid- 
den if it never occurs. A word is an irreducible forbidden 
word if it contains no proper subwords which are them- 
selves forbidden words. A system is sofic if its list of 
irreducible forbidden words is infinite. The Even system 
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FIG. 3: OCF's behavior monitored in the information plane — I[1Z;X 2 ] versus I[X 3 ;1Z] — for the Golden Mean Process. The 
correct two-state model is found. Histories of length K = 3 were used, along with futures of length L — 2. The horizontal dashed 
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line is the full predictive information E « /[X 3 ; X 2 ] = I[S; X 2 ] ~ 0.25 bits which, as seen, is an upper bound on I[1Z; X 2 ]- The 

<— — > <— 

exact value is E = I[X\ X] = 0.2516 bits [38]. Similarly, the vertical dashed line is the block entropy H[X 3 ] ~ 2.25 bits which 

is an upper bound on the retrodictive information I[X 3 ;1Z]- The statistical complexity C M w 0.92 bits, also an upper bound, 
is labeled. The annealing rate was 0.952. 




FIG. 4: Future conditional probabilities P(X 2 |-) conditioned on causal states a G S (boxes) and on the OCF reconstructed 

— » ^_ 

states p 6 71 (circles) for the Golden Mean Process. As an input to OCF, future conditional probabilities P(X 2 \ x 3 ) calculated 
from histories of length K — 3 were used (crosses). 



is one such sofic system, since its set T of irreducible 
forbidden words is infinite: F = {01 2n+1 0,n = 0,1,.. .}. 
Note that no finite-order Markovian source can generate 
this or, for that matter, any other strictly sofic system 
[25] . The Even Process then associates probabilities with 
each of the Even system's sequences by choosing a or 1 
with fair probability after generating either a or a pair 
of Is. The result is a measure sofic process — a distribu- 
tion over a sofic system's sequences. 

As in the previous example, for large A, OCF applied to 
the Even Process recovers a small, one-state model with 
poor predictive quality; see Fig. [5j As A decreases there 
are transitions to larger models that capture increasingly 
more information about the future. (The numbers along 
the curve again indicate the points of first transition to 



more states.) With a three-state model OCF captures 
the full predictive information at a model size of 56% 
of the maximum. This model is exactly the causal-state 
partition, as can be seen in Fig. [6] by comparing the fu- 
ture conditional probabilities of the OCF model (circles) 
to the true underlying causal states (boxes), which are 
not known to the algorithm. 

The correct e-machine model of the Even Process has 
four causal states: two transient and two recurrent. At 
the finite past and future lengths used here, OCF picks 
up only one of the transient states and the two recur- 
rent states. It also assigns probability to all three. This 
increases the effective state entropy (H[1Z] « 1.48 bits) 
above the statistical complexity (C^ — 0.92 bits) which is 
only a function of the two recurrent states, since asymp- 
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FIG. 5: OCF's behavior inferring the Even Process: monitored in the information plane — I[1Z;X 2 } versus I[X 3 ;1Z]. Histories 

of length K = 3 were used, along with futures of length L = 2. The horizontal dashed line is the full predictive information 

— ► — > 
I[X 3 ;X 2 ] ~ 0.292 bits which, as seen, is an upper bound on the estimates I[1Z;X 2 }- Similarly, the vertical dashed line is the 

<— «— 
block entropy H[X 3 } ~ 2.585 bits which is an upper bound on the retrodictive information I[X 3 ; 1Z}. 




FIG. 6: Future future conditional probabilities P(X 2 |-) conditioned on causal states a £ S (boxes) and on the OCF- 

— ► 

reconstructed states p £ TZ (circles) for the Even Process. As an input to OCF, future conditional probabilities P(X 2 | x' A ) 
calculated from histories of length K — 3 were used (crosses). 



totically (K oo) the transient states have zero proba- 
bility. 

There is an important lesson in this example for general 
time-series modeling, not just OCF. Correct inference of 
even finite-state, but measure-sofic processes requires us- 
ing hidden Markov models. Related consequences of this, 
and one resolution, are discussed at some length for esti- 
mating "nonhidden" Markov models of sofic processes in 
Ref. ESI. 



D. Random Random XOR: A structurally complex 
process 

The previous examples demonstrated our main theo- 
retical result: In the limit in which it becomes crucial 
to make the prediction error very small, at the expense 



of the model size, the OCF algorithm captures all of the 
structure inherent in the process by recovering the causal- 
state partition. 

However, if we allow (or prefer) a model with some 
finite prediction error, then we can make the model sub- 
stantially smaller. We have already seen what happens 
in the worst case scenario, for a periodic process. There, 
each predictive bit costs exactly one bit in terms of model 
size. However, for highly structured processes, there exist 
situations in which one can compress the model substan- 
tially at essentially no loss in terms of predictive power. 
(This is called causal compressibility |20|.) The Even 
Process is an example of such an information source: The 
statistical complexity H[S] of the causal-state partition is 
smaller than the total available historical information — 
the entropy of the past H[X K ]- 

Now, we study a process that requires keeping all of the 
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historical information to be maximally predictive, which 

is the same as stating C^(7Z) — H[X K \. (Precisely, we 
mean given the finite past and future lengths we use.) 
Nonetheless, there is a systematic ordering of models of 
different size and different predictive power given by the 
rate-distortion curve, as we change the parameter A that 
controls how much of the future fluctuations the model 
considers to be random; i.e., which fluctuations are con- 
sidered indistinguishable. Naturally, the trade-off, and 
therefore the shape of the rate-distortion curve, depends 
on and reflects the source's organization. 

As an example, consider the random-random XOR 
(RRXOR) process which consists of two successive ran- 
dom symbols chosen to be or 1 with equal probabil- 
ity and a third symbol that is the logical Exclusive-OR 
(XOR) of the two previous. The RRXOR process can be 
represented by a hidden Markov chain with five recur- 
rent causal states, but having a very large total number 
of causal states. There are 36 causal states, most (31) of 
which describe a complicated transient structure [25] . As 
such, it is a structurally complex process that an analyst 
may wish to approximate with a smaller set of states. 

Figure [7] shows the information plane, which specifies 
how OCF trades off structure for prediction error as a 
function of model complexity for the RRXOR process. 
The number of effective states (again first occurrences 
are denoted by integers along the curve) increases with 
model complexity. At a history length of K = 3 and 
future length of L = 2, the process has eight underlying 
causal states, which are found by OCF in the A — >• 
limit. The corresponding future conditional probability 
distributions are shown in Fig. [8] 

The RRXOR process has a structure that does not al- 
low for substantial compression. Fig. [TJshows that the ef- 
fective statistical complexity of the causal-state partition 

is equal to the full entropy of the past: C^(1Z) — H[X 3 ]- 
So, at L = 3, unlike the Even and Golden Mean Pro- 
cesses, the RRXOR process is not compressible. With 
half (4) of the number of states, however, OCF recon- 
structs a model that is only 33% as large, while captur- 
ing 50% of the information about the future. The cor- 
responding conditional future probabilities of the (best) 
four-state model are shown in Fig. [9] They are mixtures 
of pairs of the eight causal states. 

The rate-distortion curve informs the modeler about 
the (best possible) efficiency of predictive power to model 

— y <- 

complexity: I[TZ; X]/I[X; TZ]- This is useful, for exam- 
ple, if there are constraints on the maximum model size 
or, vice versa, on the minimum prediction error. For ex- 
ample, if we require a model of RRXOR to be 90% infor- 
mative about the future, then we can read off the curve 
that this can be achieved at 70% of the model complex- 
ity. Generally, as A decreases, phase transitions occur to 
models with a larger number of effective states [3"T] . 



VI. OPTIMAL CAUSAL ESTIMATION: 
FINITE-DATA FLUCTUATIONS 

In real world applications, we do not know a pro- 
cess's underlying probability density, but instead must 
estimate it from a finite time series that we are given. 
Let that time series be of length T and let us estimate 
the joint distribution of pasts (of length K) and futures 
(of length L) via a histogram calculated using a sliding 
window. Altogether we have M = T — (K + L — 1) 

observations. The resulting estimate P(X K ; X L ) will de- 
viate from the true ¥(X K \X L ) by A(X K ,X L ). This 
leads to an overestimate of the mutual information [U] : 

T[X K ] X L ] > I[X K : X L ]. Evaluating the objective func- 
tion at this estimate may lead one to capture variations 
that are due to the sampling noise and not to the pro- 
cess's underlying structure; i.e., OCF may over-fit. That 
is, the underlying process may appear to have a larger 
number N c of causal states than the true number. 

Following Ref. [17], we argue that this effect can 
be counteracted by subtracting from F[7Z] a model- 
complexity control term that approximates the error we 
make by calculating the estimate F[1Z] rather than the 
true F [TZ]. If we are willing to assume that M is large 

enough, so that the deviation A(X K , X L ) is a small per- 
turbation, then the error can be approximated by |17[ 
Eq. (5.8)]: 



£(N C 



1 N r 



21n(2) M 



(25) 



in the low-temperature regime, A — > 0. Recall that k L 
is the total number of possible futures for alphabet size 
k. The optimal number N* of hidden states is then the 
one for which the largest amount of mutual information 
is shared with the future, corrected by this error: 



N* 



arg max I\x K \ x^™^ ( N c 



(26) 



with 



I[X K ;X L ] 



corrected 
A-yO 



(A^) = I[X K ; X L ] X ^ (N C ) - £(N C ) 



This correction generalizes OCF to optimal causal esti- 
mation (OCE), a procedure that simultaneously accounts 
for the trade-off between structure, approximation, and 
sample fluctuations. 

We illustrate OCE on the Golden Mean and Even 
Processes studied in Sec. |V] With the correct number 
of underlying states, they can be predicted at a sub- 
stantial compression. Figures 10 and 12 show the mu- 



tual information I[1Z; X 2 ] versus the number N c of in- 
ferred states, with statistics estimated from time series 
of lengths T = 100. The graphs compare the mutual 

information I[1Z; X 2 ]\^o evaluated using the estimate 

P(X 2 ; X 3 ) (upper curve) to the corrected information 
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FIG. 7: Prediction versus structure trade-off under OCF for the random-random XOR (RRXOR) process, as monitored in the 
information plane. As above, the horizontal dashed line is the predictive information (~ 0.230 bits) and the vertical dashed 
line is the block entropy (ss 2.981 bits). Histories of length K — 3 were used, along with futures of length L — 2. The asterisk 
and lines correspond to the text: they serve to show how the predictive power and the complexity of the best four state model, 
the future conditional probabilities of which are depicted in Fig. [9] 
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FIG. 8: Future conditional probabilities P(X 2 |-) for the RRXOR process: the 8-state approximation (circles) finds the causal 

— > 

states (boxes). For example, the heavier dashed line (purple) shows P(X 2 \p) = (1/4, 1/2, 1/4,0). Histories of length K — 3 
were used, along with futures of length L = 2. 



(25) with k L = 4 and M = 96 (lower 



I[TZ; X 2 ]f"o° ted calculated by subtracting the approxi- 
mated error Eq. 
curve) . 

We see that the corrected information curves peak at, 
and thereby, select models with two states for the Golden 
Mean Process and three states for the Even Process. This 
corresponds with the true number of causal states, as we 
know from above (Sec. [V]) for the two processes. The true 
statistical complexity for both processes is C M s» 0.91830, 



while those estimated via OCE are C„ 



0.93773 and 



C M « 1.30262, respectively. (Recall that the overestimate 



for the latter was explained in Sec. V C 



Figures [TT] and [13] show the OCE future conditional 
probabilities corresponding to the (optimal) two- and 
three-state approximations, respectively. The input to 



OCE are the future conditional probabilities given the 

histories P(X 2 \ a^ 3 ) (crosses), which are estimated from 
the full historical information. Those future conditional 
probabilities are corrupted by sampling errors due to the 
finite data set size and differ from the true future condi- 
tional probabilities (squares). 

Compare the OCE future conditional probabilities 
(circles) to the true future conditional probabilities 
(squares), calculated with the knowledge of the causal 
states. (The latter, of course, is not available to the OCE 
algorithm.) In the case of the CM Process, OCE approx- 
imates the correct future conditional probabilities. For 
the Even Process there is more spread in the estimated 
OCE future conditional probabilities. Nonetheless, OCE 
reduced the fluctuations in its inputs and corrected in 
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FIG. 9: Morphs P(X 2 |-) for the RRXOR process: the 4-state approximation (circles and colored lines: state 1 - cyan/full, 2 - 
green/full, 3 - blue/dashed, 4 - purple/dashed) compared to causal states (boxes). Histories of length K = 3 were used, along 
with futures of length L — 2. 




FIG. 10: Information I captured about the future versus the number N c of reconstructed states, with statistics estimated from 

^ — ► 

length T = 100 time series sample from the Golden Mean Process. Upper line: plotted on the vertical axis is I[1Z; X 2 ]a->o (not 

corrected); lower line: plotted on the vertical axis is the quantity I[TZ; X 2 ]x" r o cted i which is the retained predictive information, 
but corrected for estimation errors due to finite sample size. The dashed line indicates the actual upper bound on the predictive 

information I[X K ; 1Z], for comparison. This value is not known to the algorithm, it is computed from the true process statistics. 
Histories of length K = 3 and futures of length L = 2 were used. The asterisk denotes the optimal number (N c — 2) of effective 
states. 



the direction of the true underlying future conditional 
probabilities. 



VII. CONCLUSION 

We analyzed an information-theoretic approach to 
causal modeling in two distinct cases: (i) optimal causal 
filtering (OCF), where we have access to the process 
statistics and desire to capture the process's structure up 
to some level of approximation, and (ii) optimal causal es- 
timation (OCE), in which, in addition, finite-data fluctu- 
ations need to be traded-off against approximation error 
and structure. The objective function used in both cases 



follows from very simple first principles of information 
processing and causal modeling: a good model should 
minimize prediction error at minimal model complexity. 
The resulting principle of using small, predictive models 
follows from minimal prior knowledge that, in particu- 
lar, makes no structural assumptions about a process's 
architecture: Find variables that do the best at causal 
shielding. 

OCF stands in contrast with other approaches. Hidden 
Markov modeling, for example, assumes a set of states 
and an architecture [30] ■ OCF finds these states from 
the given data. In minimum description length modeling, 
to mention another contrast, the model complexity of a 
stochastic source diverges (logarithmically) with the data 



12 



4« 



1.0 

0.8 
0.6 
0.4 
0.2 
0.0 













- 
- 


X, 


.X. . 


■' - ; 


- 
- 










'-X 

v.v.v.-.g 


- — 


^ 






•■■-■■-X 


00 


01 


10 




11 



It 2 



FIG. 11: OCE's best two-state approximated future conditional probabilities (circles) for the Golden Mean Process. Compared 
to true (unknown) future conditional probabilities (squares). The OCE inputs are the estimates of P(X 2 \ x 3 ) (crosses). 




FIG. 12: Information / captured about the future versus the number N c of reconstructed states, with statistics estimated from 

^ — h ^ — > , 

length T = 100 time series sample from the Even Process. Upper line: I[1Z; X 2 ]x^o, not corrected; lower line: I[1Z; X 2 ] c \"o c G , 

corrected for estimation error due to finite sample size. The dashed line indicates the actual upper bound on the predictive 

information, for comparison. This value is not known to the algorithm, it is computed from the true process statistics. Histories 

of length K — 3 and futures of length L = 2 were used. The asterisk denotes the optimal number (JV C = 3) of effective states. 



set size [H], as happens even when modeling the ideal 
random process of a fair coin. OCF, however, finds the 
simplest (smallest) models. 

Our main result is that OCF reconstructs the causal- 
state partition, a representation previously known from 
computational mechanics that captures a process's causal 
architecture and that allows important system proper- 
ties, such as entropy rate and stored information, to 
be calculated [22]. This result is important as it gives 
a structural meaning to the solutions of the optimiza- 
tion procedure specified by the causal inference objec- 
tive function. We have shown that in the context of 
time series modeling, where there is a natural relevant 
variable (the future), the IB approach [16] recovers the 
unique minimal sufficient statistic — the causal states — 
in the limit in which prediction is paramount to com- 
pression. Altogether, this allows us to go beyond plausi- 
bility arguments for the information-theoretic objective 



function that have been used. We showed that this way 
(OCI) of phrasing the causal inference problem in terms 
of causal shielding results in a representation that is a suf- 
ficient statistic and minimal and, moreover, reflects the 
structure of the process that generated the data. OCI 
does so in a way that is meaningful and well grounded in 
physics and nonlinear dynamics. The optimal solutions 
to balancing prediction and model complexity take on 
meaning — asymptotically, they are the causal states. 

The results also contribute to computational mechan- 
ics: The continuous trade-off allows one to extend the 
deterministic history-to-state assignments that compu- 
tational mechanics introduced to "soft" partitions of his- 
tories. The theory gives a principled way of constructing 
stochastic approximations of the ideal causal architec- 
ture. The resulting approximated models can be sub- 
stantially smaller and so will be useful in a number of 
applications. 
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FIG. 13: OCE's best three-state approximated future conditional probabilities (circles) for the Even Process (d). Compared 

— ► -t— 

to true (unknown) future conditional probabilities (squares). The OCE inputs are the estimates of P(X 2 \ x 3 ) (crosses). 



Finally, we showed how OCF can be adapted to correct 
for finite-data sampling fluctuations and so not over-fit. 
This reduces the tendency to see structure in noise. OCE 
finds the correct number of hidden causal states. This 
renders the method useful for application to real data. 
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